Add validation for CardinalityEffect::equal when executing a plan#20875
Add validation for CardinalityEffect::equal when executing a plan#20875xanderbailey wants to merge 3 commits into
Conversation
| /// number of output rows as their input. This is a post-execution | ||
| /// sanity check useful for debugging correctness issues. | ||
| /// Disabled by default as it adds a small amount of overhead. | ||
| pub verify_cardinality_effect: bool, default = false |
There was a problem hiding this comment.
Happy to have this just be on by default... The cost IMO is pretty small, it's a single tree walk and an extra Arc::clone. It may also be that there are current bugs in CardinalityEffect::Equal declaration?
There was a problem hiding this comment.
Can you check whether enabling by default will pass all the (unit/slt/...) tests?
I wonder if all the ExecutionPlan update the metrics correctly after executing (would be a great way anyway of verifying this).
There was a problem hiding this comment.
Let me give that a go today!
|
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
Which issue does this PR close?
N/A — preventive measure inspired by #20683 / #20672 where
RepartitionExecsilently dropped rows when spilling under memory pressure.Rationale for this change
The bug fixed in #20672 (repartition dropping data when spilling) was a silent correctness issue — queries returned wrong results with no error. This class of bug is particularly dangerous because it can go undetected.
DataFusion operators already declare their expected cardinality effect via
CardinalityEffect::Equal(meaning "I output exactly as many rows as I receive"), but nothing actually verified this at runtime. This PR adds a post-execution sanity check that catches any such violation, so bugs like #20683 can be detected immediately rather than silently producing wrong results.What changes are included in this PR?
datafusion.execution.verify_cardinality_effect(default:false) inExecutionOptionscardinality_checkmodule indatafusion-physical-planwith avalidate_cardinality_effect()function that walks the executed plan tree usingExecutionPlanVisitorand verifies that every operator declaringCardinalityEffect::Equalproduced the same number of output rows as its input (based on post-execution metrics)collect()andcollect_partitioned()that run the validation after all streams are consumed, when the config flag is enabled#[derive(Debug, Clone, Copy)]onCardinalityEffect— the enum was missing these basic derivesThe check gracefully skips operators where metrics are unavailable, where
fetchlimits are set, or where the operator is not unary (one in - one out). When a violation is detected, it returnsDataFusionError::Internalwith the operator name and mismatched row counts.Are these changes tested?
Yes. Six unit tests cover:
fetchsetAre there any user-facing changes?
New config option
datafusion.execution.verify_cardinality_effect(defaultfalse). When enabled viaSET execution.verify_cardinality_effect = true, DataFusion will validate row count invariants after query execution and return an error if any operator that should preserve row counts (likeRepartitionExec,ProjectionExec,CoalesceBatchesExec,SortExec, etc.) produces a different number of rows than its input.